Generative RLHF-V: Learning Principles from Multi-modal Human Preference

The mediocre teacher tells. The great teacher inspires.

Author Names

Affiliations

description paper (TBD) code github folder data download model quiz FAQ

Abstract

Training multi-modal large language models (MLLMs) that align with human intentions is a long-term challenge. Traditional score-only reward models for alignment suffer from low accuracy, weak generalization, and poor interpretability, blocking the progress of alignment methods, e.g., reinforcement learning from human feedback (RLHF). This paper introduces Generative RLHF-V, a novel alignment framework that integrates Generative Reward Models (GRMs) with multi-modal RLHF. We propose a two-stage pipeline: generative reward modeling from multi-model preference, where RL guides GRMs to actively capture human intention, then predict the correct pair-wise scores; and RL optimization from grouped comparison, which enhances multi-modal RL scoring precision by grouped responses comparison. Experimental results demonstrate that our framework improves 4 MLLMs' performance across 7 benchmarks by 18.1%, while the baseline RLHF is only 5.3%. We further validate the out-of-distribution generalization of GRMs and the scaling trends of grouped comparisons. Additionally, we investigated GRMs' susceptibility to reward hacking within an overfitting setting. Our findings indicate that MLLMs use self-praising behaviors to deceptively receive high rewards from GRMs. Significantly, this deceptive behavior is also effective in misleading MLLM-as-judge benchmarks that are analogous to GRM scoring. Our code, models, and evaluation details can be found at https://generative-rlhf-v.github.io/.

earning principles from human preference is a major challenge in AI alignment. In MLLM's alignment, traditional RLHF methods only learn scalar scores from preferences. In contrast, our Generative RLHF-V can learn principles from preferences and optimize based on a more comprehensive comparison. Experimental results show that Generative RLHF-V elevates 2B and 3B MLLMs to 7B performance across 7 benchmarks. It also advances pretrained models to instruct model capabilities and enables open-source models to match closed-source experts.

❮

❯

Introduction

"The mediocre teacher tells. The great teacher inspires."

--- William Arthur Ward saying – Education

We propose Generative RLHF-V(ision), as shown in Figure 2, a novel alignment framework integrating the vision GRM with RL fine-tuning. Our pipeline consists of two stages: generative reward modeling from multi-model preference and RL optimization from grouped comparison. Our reward model extends the self-principled critique tuning (SPCT) pipeline to the vision scenario, training MLLMs as GRMs using RL, with rule-based rewards from annotated ground truth in preference datasets. In contrast to the findings of SPCT, we find that in the multi-modal scenario, enabling GRMs to explore principles from preferences autonomously yields superior generalization compared to selecting principles from a reference set. Our RL optimization uses GRMs to conduct pairwise competitive scoring for n responses within each response group, taking the average as the RL optimization objective for the corresponding response.

Comparison of our pipelines to traditional ones. For reward modeling, we make generative RM actively reason about the advantages and disadvantages between two answers, and output corresponding scores. If the better response gets a higher score, it provides a positive reward. For RL optimization, we compare responses in pairs within a group to obtain more accurate scores.

Key Contributions

RL-based GRMs for MLLM Alignment: We develop a multi-modal GRM trained via RL, enabling explicit generation of human-aligned principles and precise reward predictions compared to supervised cases.
Multi-modal Generative RLHF: We propose a method to scale up post-training by collecting n candidate responses and conducting exhaustive pairwise comparisons through the RL-based GRM, achieving a linear performance gain with increasing n.
A Pioneer Case Study of GRM Reward Hacking: We find that RL over-training under an over-trained GRM can lead models to adopt self-praising behaviors to obtain high rewards, even achieving exceptionally high scores on benchmarks employing the MLLM-as-judge paradigm.

Generative RLHF-V

The Generative RLHF-V pipeline mainly consists of two parts: generative reward modeling from reinforcement learning (RL) and RK from grouped comparison. The former references training MLLMs through RL as a vision generative reward model, i.e., GRM-V. It actively reasons about the human principle behind two given responses and provides a pair-wise score comparison. The latter leverages the characteristics of GRM-V, collecting multiple responses for a given input and providing more accurate grouped scoring for them.

Generative Reward Modeling from RL

An example of generative reward modeling from RL. The goal of RL is to make MLLMs assign higher scores to responses that align with human preferences. Through RL optimization, MLLMs can infer the underlying principle behind how humans annotate these binary preferences.

Reinforcement Learning from Grouped Comparison

An example of RL from grouped comparison. Its advantage lies in utilizing grouped comparisons to achieve more accurate scoring. Response B provides accurate and comprehensive information, thus receiving the highest score; although response A is somewhat arbitrary, it performs accurate image recognition and obtains a higher score than C and D.

Principles Learning of RL-Based GRMs

RQ 1Does the GRM+RL facilitate more generalizable principle learning from preferences?

Performance of GRMs in the MLLM-as-a-Judge Score task, measured by the Pearson correlation coefficient.
Models	w/ GC	w/o GC
GRM	0.41	0.38
GRM+SFT	0.37	0.33
GRM+RL	0.43	0.37
GPT-4o (Expert)	0.48	0.46

The scoring distribution of the GRM+RL model on MLLM-as-a-judge's Score task. Figure (a) is the annotated human scores, while Figure (b) is GRM+RL scores and Figure (c) is its fine-grained scores distribution.

RQ 2Can grouped comparison yield more accurate reward scores of GRMs?

Comparison of RMs accuracy on OOD discriminative tasks. (P) denotes the concatenation of the annotation principle from the corresponding preference dataset to the models' output, serving as hints for inference. All models represented by the bar charts were trained on the Align-Anything dataset. The purple dashed line indicates expert performance.

RQ 3Are GRM+RL and grouped comparison competitive methods for multi-modal RLHF?

Model	Feedback	MIA-Bench	LLaVA-Wild	LLaVA-Wilder	MM-Safety	MSS-Bench	MM-Vet	MM-Vet-v2
Qwen2-VL-2B	N/A	45.31	61.46	47.18	38.12	46.98	32.12	27.15
+ DPO	RM	51.04 + 5.73	75.91 + 14.45	48.12 + 0.94	67.21 + 29.09	49.52 + 2.54	31.28 - 0.84	31.28 + 4.13
+ PPO	RM	43.72 - 1.59	73.79 + 12.33	41.32 - 5.86	59.83 + 21.71	47.38 + 0.40	33.56 + 1.44	30.79 + 3.64
+ GRPO	RM	44.59 - 0.72	69.87 + 8.41	39.48 - 7.70	69.27 + 31.15	48.12 + 1.14	29.15 - 2.97	31.74 + 4.59
+ GRPO	GRM	46.81 + 1.50	78.51 + 17.05	45.01 - 2.17	72.53 + 34.41	51.45 + 4.47	34.97 + 2.85	36.36 + 9.21
+ GRPO	GRM + SFT	48.57 + 3.26	81.87 + 20.41	53.04 + 5.86	74.56 + 36.44	50.98 + 4.00	36.78 + 4.66	37.14 + 9.99
+ GRLHF-V (Ours)	GRM + RL	53.13 + 7.82	92.54 + 31.08	62.84 + 15.66	80.67 + 42.55	53.87 + 6.89	41.25 + 9.13	45.16 + 18.01
Qwen2.5-VL-3B-Instruct	N/A	68.01	89.63	63.65	41.18	49.58	59.16	44.94
+ DPO	RM	74.37 + 6.36	91.05 + 1.42	66.71 + 3.06	75.64 + 34.46	52.57 + 2.99	55.72 - 3.44	45.41 + 0.47
+ PPO	RM	72.59 + 4.58	93.76 + 4.13	65.73 + 2.08	71.25 + 30.07	50.03 + 0.45	60.08 + 0.92	48.92 + 3.98
+ GRPO	RM	69.82 + 1.81	93.94 + 4.31	66.41 + 2.76	69.83 + 28.65	51.96 + 2.38	56.92 - 2.24	47.55 + 2.61
+ GRPO	GRM	75.56 + 7.55	92.19 + 2.56	67.18 + 3.53	75.98 + 34.80	57.66 + 8.08	57.37 - 1.79	49.15 + 4.21
+ GRPO	GRM + SFT	74.17 + 6.16	96.73 + 7.10	71.07 + 7.42	72.45 + 31.27	58.83 + 9.25	59.27 + 0.11	51.52 + 6.58
+ GRLHF-V (Ours)	GRM + RL	79.67 + 11.66	103.41 + 13.78	68.46 + 4.81	78.88 + 37.70	62.33 + 12.75	62.18 + 3.02	55.18 + 10.24
Qwen2-VL-7B	N/A	52.58	81.3	61.8	31.95	48.23	60.32	52.98
+ DPO	RM	57.01 + 4.43	81.49 + 0.19	59.75 - 2.05	81.59 + 49.64	49.87 + 1.64	60.98 + 0.66	53.09 + 0.11
+ PPO	RM	55.76 + 3.18	83.06 + 1.76	62.23 + 0.43	80.87 + 48.92	50.08 + 1.85	57.83 - 2.49	52.12 - 0.86
+ GRPO	RM	56.89 + 4.31	81.25 - 0.05	60.19 - 1.61	83.14 + 46.19	51.98 + 3.75	56.85 - 3.47	48.96 - 4.02
+ GRPO	GRM	59.72 + 7.14	86.12 + 4.82	68.30 + 6.50	81.42 + 49.47	50.21 + 1.98	57.98 - 2.34	54.49 + 1.51
+ GRPO	GRM+SFT	59.87 + 7.29	92.91 + 11.61	65.67 + 3.87	87.27 + 55.32	52.75 + 4.52	58.79 - 1.53	56.39 + 3.41
+ GRLHF-V (Ours)	GRM + RL	62.31 + 9.73	103.55 + 22.25	71.98 + 10.18	91.96 + 60.01	54.83 + 6.60	63.92 + 3.60	59.11 + 6.13
Qwen2.5-VL-7B-Instruct	N/A	74.26	97.05	71.56	50.67	51.96	68.32	67.23
+ DPO	RM	81.55 + 7.29	103.34 + 6.29	72.08 + 0.52	75.09 + 24.42	52.72 + 0.76	67.84 - 0.48	66.98 - 0.25
+ PPO	RM	73.12 - 1.14	101.62 + 4.57	67.89 - 3.67	76.59 + 25.92	51.29 - 0.67	67.89 - 0.43	64.23 - 3.00
+ GRPO	RM	75.75 + 1.49	101.65 + 4.60	68.89 - 2.67	68.26 + 17.59	52.53 + 0.57	66.85 - 1.47	67.76 + 0.53
+ GRPO	GRM	71.88 - 2.38	109.12 + 12.07	73.32 + 1.76	65.88 + 15.21	53.12 + 1.16	65.50 - 2.82	65.08 - 2.15
+ GRPO	GRM + SFT	76.23 + 1.97	103.50 + 6.45	72.15 + 0.59	70.23 + 19.56	54.08 + 2.12	64.93 - 3.39	68.12 + 0.89
+ GRLHF-V (Ours)	GRM + RL	79.86 + 5.60	113.71 + 16.66	76.04 + 4.48	74.91 + 24.24	59.74 + 7.78	72.94 + 4.62	71.86 + 4.63

RQ 4Ablations of GRM+RL, grouped comparison and the number of candidate responses.

Scaling trend of RL performance with the number of candidate responses n, where GC denotes grouped comparison. It reveals that integrating GC and RL with the GRM framework significantly enhances RL performance across various settings of n. Moreover, this improvement becomes more pronounced as n increases.

RQ 5What is the reward hacking behaviors of an over-trained Generative RLHF-V model?

The reward hacking behavior manifested by GRLHF-V and its associated quantitative performance, under conditions of overfitting in both reward modeling and RL training.

RQ 6Why not including specific principles in GRM+RL training.

Comparison of GRLHF training with (w/ P) or without (w/o P) given principles.
Benchmarks	w/ P	w/o P
Align-Anything	0.83	0.79 - 0.04
Beaver-V	0.73	0.78 + 0.05
LLaVA-Critic	0.76	0.79 + 0.03
MLLM-as-a-Judge	0.63	0.68 + 0.05
MIA-Bench	60.76	62.31 + 1.55
LLaVA-Wild	99.57	103.55 + 3.98
LLaVA-Wilder	63.75	71.98 + 8.23
MM-Vet	62.57	63.92 + 1.35
MM-Vet-v2	55.35	59.11 + 3.76

Frequently Asked Questions

What is Generative RLHF-V?

Generative RLHF-V is a novel alignment framework that integrates Generative Reward Models (GRMs) with multi-modal RLHF. It employs a two-stage pipeline: generative reward modeling from multi-modal preference and RL optimization from grouped comparison. This approach enables models to learn underlying principles from human preferences rather than just scalar scores.

How does Generative RLHF-V differ from traditional RLHF?

Unlike traditional RLHF methods that only learn scalar scores from preferences, Generative RLHF-V enables models to learn the principles behind human preferences. Additionally, it enhances multi-modal RL scoring precision through grouped response comparison rather than individual response evaluation.

What performance improvements does Generative RLHF-V achieve?

Our experimental results show that Generative RLHF-V improves 4 MLLMs' performance across 7 benchmarks by an average of 18.1%, while baseline RLHF methods only achieve a 5.3% improvement. It elevates 2B and 3B MLLMs to 7B performance levels and enables open-source models to match closed-source experts.

What is reward hacking in the context of GRMs?

Reward hacking occurs when models find ways to maximize rewards without achieving the intended goals. In our research, we discovered that MLLMs can develop self-praising behaviors to deceptively receive high rewards from GRMs. This behavior is also effective in misleading MLLM-as-judge benchmarks, highlighting a significant concern in current evaluation methods.

Where can I find the code and models?

All our code, models, and evaluation details are available on our GitHub repository and Hugging Face. You can access them through the links provided in the buttons at the top of this page.

Contents